-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12340][SQL]fix Int overflow in the SparkPlan.executeTake, RDD.take and AsyncRDDActions.takeAsync #10562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #2307 has finished for PR 10562 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a blank line is needed here.
|
I have removed some blank lines. |
|
@QiangCai the problem isn't blank lines but whitespace at the end of your lines. |
|
@srowen I have removed some whitespaces. |
|
Test build #2310 has finished for PR 10562 at commit
|
|
@srowen I have no idea how to resolve this error of unit tests. Would you help me? |
|
@QiangCai I think the test failures are unrelated. However before we can retest you'll have to rebase as there is a merge conflict now. |
…take and AsyncRDDActions.takeAsync
|
@srowen I have rebased from master and resolved all conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed it.
|
Test build #2325 has finished for PR 10562 at commit
|
|
@srowen I have found some error messages in test build log, a OutOfMemoryError exception has happened. The code in 71 line of the file AsyncRDDActions.scala is "val results = new ArrayBufferT ", because the param num(2147483638) is too large, so JVM can't allocate enough memory space. error messages: |
|
Why the instance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a chance to modify again, please insert a white space between ) and {.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do it.
|
@sarutak Maybe we have found another bug. I will try to fix it. |
|
I have removed the initial size num. The initial size will be a default value 16. It is the same to RDD#take and will be okey. |
|
Test build #2326 has finished for PR 10562 at commit
|
|
I think I have resolved this problem. |
|
LGTM |
|
Merging this into |
|
@QiangCai We have many conflicts against |
|
OK. I have created another PR #10619 to merge this code to branch-1.6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change necessary? When can partsScanned go above 2B?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. partScanned cannot exceed the value of totalParts.
I'll return it to Int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a legit problem here. Imagine totalParts is close to Int.MaxValue, and imagine partsScanned is close to totalParts. Adding p.size to it below could cause it to roll over. I think this change is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's never possible -- if we have anywhere near 2B partitions, the scheduler won't be fast enough to schedule them. As a matter of fact, if we have anywhere larger than a few millions, the scheduler will likely crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, in practice this all but certainly won't happen. Note that this patch was already committed to master making this a Long. It doesn't hurt and is very very theoretically more correct locally. I suppose I don't think it's worth updating again, but I do not feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to change it back since it is so little work, so this does not start a trend to change all ints to longs for no reason. Note that this also raise questions about why this value can be greater than int.max when somebody reads this code in the future.
Also @srowen even if totalParts is close to int.max, I don't think partsScanned can be greater than int.max because we never scan more parts than the number of parts available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok you were referring to partsScanned + numPartsToTry - we should just cast that to long to minimize the impact.
|
@QiangCai it would be great if you can submit a new pull request to address the comments. Thanks. |
This is a follow-up for the original patch apache#10562.
This is a follow-up for the original patch #10562. Author: Reynold Xin <rxin@databricks.com> Closes #10670 from rxin/SPARK-12340.
I have closed pull request #10487. And I create this pull request to resolve the problem.
spark jira
https://issues.apache.org/jira/browse/SPARK-12340